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Abstract 

Background: De Brujin graphs are widely used in bioinfornnatics for processing next-generation sequencing data. 
Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several 
approaches to this problem have been proposed recently. 

Results: In this work, we show how to reduce the memory required by the data structure of Ghikhi and Rizk (WABI'l 2) 
that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to 
their method, with insignificant impact on construction time. At the same time, our experiments showed a better 
query time compared to the method of Ghikhi and Rizk. 

Conclusion: The proposed data structure constitutes, to our knowledge, currently the most efficient practical 
representation of de Bruijn graphs. 

Keywords: Next-generation sequencing. Genome assembly, de Brujin graph. Bloom filter 



Background 

Modern next-generation sequencing (NGS) technologies 
generate huge volumes of short nucleotide sequences 
(reads) drawn from a DNA sample under study. The 
length of a read varies from 35 to about 400 base pairs 
(letters) and the number of reads may be hundreds of mil- 
lions, thus the total volume of data may reach tens or even 
hundreds of Gb. 

Many computational tools dealing with NGS data, espe- 
cially those devoted to genome assembly, are based on the 
concept of a de Bruijn graph, see e.g. [1]. Nodes of a de 
Bruijn graph^ correspond to all distinct k-mers occurring 
in the given set of reads, and two k-mers are linked by 
an arc if they have a suffix-prefix overlap of size /c — 1. 
The value of k is an open parameter that in practice is 
chosen between 20 and 64. The idea of using de Bruijn 
graphs for genome assembly goes back to the "pre-NGS 
era" [2] . Note, however, that de novo genome assembly is 
not the only application of those graphs when dealing with 
NGS data. There are several others, including: de novo 

^Correspondence: Gregory.Kucherov(5)univ- mlv.fr 

^Department of Computer Science, Ben-Gurion University of the Negev, Be'er 
Sheva, Israel 

^Laboratoire d'Informatique Gaspard Monge, Universite Paris-Est & CNRS, 
Marne-la-Vallee, Paris, France 

Full list of author information is available at the end of the article 



transcriptome assembly [3] and de novo alternative splic- 
ing calling [4] from transcriptomic NGS data (RNA-seq); 
metagenome assembly [5] from metagenomic NGS data; 
and genomic variant detection [6] from genomic NGS 
data using a reference genome. 

Due to a very large size of NGS datasets, it is essential 
to represent de Bruijn graphs as compactly as possible. 
This has been a very active line of research. Recently, sev- 
eral papers have been published that propose different 
approaches to compressing de Bruijn graphs [7-11]. 

Conway and Bromage [7] proposed a method based on 
classical succinct data structures, i.e. bitmaps with effi- 
cient rank/select operations. On the same direction, Bowe 
et al. [10] proposed an interesting succinct representa- 
tion that, assuming only one string (read) is present, uses 
only 4£ bits, where E is the number of arcs in the graph. 
The more realistic case where there are R reads can be 
easily reduced to the one string case by concatenating 
all R reads using a special separator character. In this 
case, however, the size of the structure is 4£ + 0(7? log £) 
bits ([10], Theorem 1). Since the multiplicative constant 
of the second term is difficult to evaluate, it is hard to 
know precisely what would be the size of this structure in 
practice. 
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Ye et al [8] proposed a different method based on a 
sparse representation of de Bruijn graphs, where only a 
subset of k-mers present in the dataset are stored. Using 
the Bloom filter data structure, Pell et al [11] proposed 
probabilistic de Bruijn graphs as a compact approximate 
representation of full de Bruijn graphs. Finally, Chikhi and 
Rizk [9] improved Pells scheme in order to obtain an exact 
representation of the de Bruijn graph. 

A direct application of Bloom filters to de Bruijn graphs, 
studied in [11], results in a space-efficient representation 
at the price of allowing one-sided errors^ namely false pos- 
itive nodes {k-mevs). The method of [9] removes these 
errors and proposes a space-efficient data structure for 
the exact representation of de Bruijn graphs. The method 
is based on the following idea. In the genome assembly 
application, de Bruijn graphs are only used for traversal, 
and random accesses to graph nodes are never performed. 
If all queried nodes {k-mevs) are only those which are 
reachable from some node known to belong to the graph, 
then only a fraction of all false positives can actually occur. 
Storing these false positives explicitly leads to an exact 
(false positive free) and space-efficient representation of 
the de Bruijn graph. This is the best practical exact repre- 
sentation of de Bruijn graphs for the purpose of genome 
assembly, now implemented in MiNiA software [15]. 

Our main contribution is an improvement of the 
method of [9] by changing the representation of the set 
of false positives. We achieve this by iteratively applying a 
Bloom filter to represent the set of false positives, then the 
set of "false false positives" etc. We show analytically that 
this cascade of Bloom filters allows for a considerable fur- 
ther economy of memory, improving the method of [9]. 
Depending on the value of ky our method requires 30% to 
40% less memory with respect to the method of [9]. More- 
over, with our method, the memory grows very little as k 
grows. Finally, we implemented our method and tested it 
against [9] on real datasets. The tests confirm the theoret- 
ical predictions for the size of structure and show a 20% to 
30% improvement in query times. 

Preliminaries 

A Bloom filter is a space-efficient data structure for repre- 
senting a given subset of elements T ^ U, with support 
for efficient membership queries with one-sided error. 
That is, if a query for an element x e U returns no then 
X ^ but if it returns yes then either x e or, with 
small probability, x ^ T (false positive). A Bloom filter 
consists of a bitmap (array of bits) B of size m and a set 
of p distinct hash functions {/zi, . . . , /z^}, where hi : U 
{0, . . . , m — 1}. Initially, all bits of B are set to 0. An inser- 
tion of an element x e T is done by setting the bits of 
B with indices hiix), . . . , hp{x) to 1, i.e. B[hi{x)\ = 1 for 
all i e[l,p]. Membership queries are done symmetrically, 
returning yes if all B[hi(x)] are equal 1 and no otherwise. 



As shown in [12], when considering hash functions that 
yield equally likely positions in the bitmap, and for large 
enough bitmap size m and number of inserted elements 
the false positive rate T is 

jr ^ (1 _ e-P'^/'^y = (1 - e-P^y (1) 

where r = m/n is the number of bits (of the bitmap B) per 
element (of T represented). It is not hard to see that this 
expression is minimized when p = r In 2, giving the false 
positive rate 

jr ^ (1 _ g-ln2^rln2 ^ ^1/2)'^^ ^ o.6185^ (2) 

A de Bruijn graph, for a given parameter /c, of a set of 
reads (strings) 7^ c E* = {A, C, T, G}* is entirely defined 
by the set T c = of /c-mers present in IZ. The nodes 
of the graph are precisely the /c-mers of T and for any two 
vertices w, v g T, there is an arc from u to v if the suffix 
of u of size (k — 1) is equal to the prefix of v of the same 
size. Thus, given a set T c of /c-mers, we can represent 
its de Bruijn graph using a Bloom filter B. This approach 
has the disadvantage of having false positive nodes, as a 
direct consequence of false positives in the Bloom filter, 
which can create false connections in the graph (see [11] 
for the influence of false positive nodes on the topology 
of the graph). The naive way to remove those false pos- 
itives nodes by explicitly storing (e.g. using a hash table) 
the set of all false positives of B is clearly inefficient, as 
the expected number of elements to be explicitly stored is 

The key idea of [9] is to explicitly store only a small sub- 
set of all false positives of B, the so-called critical false 
positives. Consider a /c-mer u that belongs to T, u has at 
most 2|E| = 8 potential neighbors, i.e. /c-mers overlap- 
ping uhy (k — 1) letters. The set of critical false positives 
consists of the potential neighbors of A"-mers of T that are 
false positives of B. This set is, in general, much smaller 
than the set of all false positives of B, its expected size can 
be upper-bounded by 8|r|J^. On the other hand, storing 
the set of critical false positives is clearly sufficient to rep- 
resent the de Bruijn graph if one only wants to support 
graph traversal, i.e. navigation from a node of the graph 
to its neighbors. In this case, only potential neighbors of 
nodes in T are queried. 

Cascading Bloom filter 

Let 7^ be a set of reads and To be the set of occurring 
/c-mers (nodes of the de Brujin graph) that we want to 
store. As stated in Section "Preliminaries", the method 
of [9] stores Tq via a bitmap Bi using a Bloom filter, 
together with the set Ti of critical false positives. Ti con- 
sists of potential neighbors of Tq which are stored in Bi 
"by mistake", i.e. belong^ to Bi but not to Tq. Bi and Ti 
are sufficient to represent the graph provided that the only 
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queried k-mers are those which are potential neighbors of 
/c-mers of Tq. 

The idea we introduce in this work is to use this struc- 
ture recursively and represent the set Ti by a new bitmap 
B2 and a new set then represent T2 by B3 and Ts, and 
so on. More formally, starting from Bi and Ti defined as 
above, we define a series of bitmaps Bi,B2, . . . and a series 
of sets Ti, 22, . . . as follows. B2 stores the set of false pos- 
itives Ti using another Bloom filter, and T2 contains the 
critical false positives of ^2) i-e. true positives from Tq that 
are stored in B2 "by mistake" (we call them false false pos- 
itives). B3 and Is, and, generally, Bi and Ti are defined 
similarly: Bi stores /c-mers of Ti-i using a Bloom filter, 
and Ti contains /c-mers stored in Bi "by mistake", i.e. those 
k-mers in Bi that do not belong to Ti-i but belong to 
T/_2. Observe that H Ti = 0, To 2 3^2 2 T4... and 

The following lemma shows that the construction is cor- 
rect, that is it allows one to verify whether or not a given 
/c-mer belongs to the set Tq, 

Lemma 1. Given a k-mer (node) u, consider the small- 
est / such that u ^ Bi-^i (if u ^ Bi,we define / = 0). Then, 
if / is odd, then u e Tq^ and if / is even (including 0), then 
u ^ Tq. 

Proof Observe that u ^ implies u ^ Ti by the basic 
property of Bloom filters that membership queries have 
one-sided error, i.e. there are no false negatives. We first 
check the Lemma for / = 0, 1. 

For / = 0, we have u ^ B\, and then u ^ Tq, 

For / = 1, we have u e Bi but w ^ ^2- The latter implies 
that u ^ T\y and then u must be a false false positive, that 
is M G Tq, Note that here we use the fact that the only 
queried /c-mers u are either nodes of Tq or their potential 
neighbors in the graph (see [9]), and therefore iiu e Bi 
and u ^ To then u e Ti, 

For the general case / > 2, we show by induction that 
u G Ti-i, Indeed, u e Bi (1 . . . (1 Bi implies u g Ti-i U Ti 
(which, again, is easily seen by induction), and u ^ Bi-^i 
implies u ^ Ti, 

Since Ti-i c Tq for odd and Ti-i c Ti for even / (for 
To n Ti = 0), the lemma follows. □ 

Naturally, Lemma 1 provides an algorithm to check if a 
given k-mer u belongs to the graph: it suffices to check 
successively if it belongs to B\,B2, . . . until we encounter 
the first which does not contain u. Then, the answer 
will simply depend on whether / is even or odd: u belongs 
to the graph if and only if / is odd. 

In our reasoning so far, we assumed an infinite num- 
ber of bitmaps Bi. Of course, in practice we cannot store 
infinitely many (and even simply many) bitmaps. There- 
fore, we truncate the construction at some step t and 



store a finite set of bitmaps B\,B2, . . .>Bt together with an 
explicit representation of Tt. The procedure of Lemma 1 
is extended in the obvious way: if for all 1 < / < 
u e Bi, then the answer is determined by directly checking 
u e Tt. 

Analysis of the data structure 

Memory and time usage 

First, we estimate the memory needed by our data struc- 
ture, under the assumption of an infinite number of 
bitmaps. Let N be the number of true positives, i.e. | | = 
N. As stated in Section "Preliminaries", if Tq has to be 
stored via a bitmap Bi of size rA/, the false positive rate can 
be estimated as c^, where c = 0.6185. And, the expected 
number of critical false positive nodes (set Ti) has been 
estimated in [9] to be 8Nc^, as every node has eight exten- 
sions, i.e. potential neighbors in the graph. We slightly 
refine this estimation to 6Nc^ by noticing that for most of 
the graph nodes, two out of these eight extensions belong 
to Tq (are real nodes) and thus only six are potential false 
positives. Furthermore, to store these 6Nc^ critical false 
positive nodes, we use a bitmap B2 of size 6rNc^, Bitmap 
B3 is used for storing nodes of Tq which are stored in B2 
"by mistake" (set T2). We estimate the number of these 
nodes as the fraction (false positive rate of filter B2) of 
N (size of To), that is Nc^, Similarly, the number of nodes 
we need to put to B4, is 6Nc^ multiplied by c^, i.e. 6Nc^^, 
Keeping counting in this way, the memory needed for the 
whole structure is rN + 6rNc^ + rNc^ + 6rNc^^ + rNc^^ + . . . 
bits. The number of bits per k-mer is then 

r-\-6rc''-\-rc''-\-6rc^''-\-. . . = (r + 6rc0(l + c'' + c^'' + . . .) 

= (l + 6c0-^. (3) 

A simple calculation shows that the minimum of this 
expression is achieved when r = 5.464, and then the 
minimum memory used per k-mer is 8.45 bits. 

As mentioned earlier, in practice we store only a finite 
number of bitmaps Biy...yBt together with an explicit 
representation (such as array or hash table) of Tt. In this 
case, the memory taken by the bitmaps is a truncated sum 
rN + GrNd" + rNd" + .., and a data structure storing Tt 
takes either 2k • Nc^'^'^^ or 2k • 6Nc^^'^^ bits, depending 
on whether t is even or odd. The latter follows from the 
observations that we need to store Nc^2'^^ (or GNc^'^'^^) k- 
mers, each taking 2k bits of memory. Consequently, we 
have to adjust the optimal value of r minimizing the total 
space, and re-estimate the resulting space spent on one 
k-mer. 

Table 1 shows estimations for optimal values of r and the 
corresponding space per k-mer for ^ = 4 and t = 6, and 
several values of k. The data demonstrates that even such 
small values of t lead to considerable memory savings. It 
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Table 1 1 st column: ^-mer size; 2nd and 4th columns: optimal value of r for Bloom filters (bitmap size per number of 
stored elements) for t = 4 and t = 6 respectively; 3rd and 5th columns: the resulting space per ^-mer (for t = 4 and 
t = 6); 6th column: space per ^-mer for the method of [9] (t = 1 ) 



k 


Optimal r 


Bits per /c-mer 


Optimal r 


Bits per /c-mer 


Bits per /c-mer 




for t = 4 


for t = 4 


for t = 6 


for t = 6 


fort=1([9]) 


16 


5.777 


8.556 


5.506 


8.459 


12.078 


32 


6.049 


8.664 


5.556 


8.470 


13.518 


64 


6.399 


8.824 


5.641 


8.490 


14.958 


128 


6.819 


9.045 


5.772 


8.524 


16.398 



appears that the space per /c-mer is very close to the opti- 
mal space (8.45 bits) obtained for the infinite number of 
filters. Table 1 reveals another advantage of our improve- 
ment: the number of bits per stored /c-mer remains almost 
constant for different values of k. 

The last column of Table 1 shows the memory usage 
of the original method of [9], obtained using the esti- 
mation (^1.44 log2 (^^^ +2.08^ the authors provided. 
Note that according to that estimation, doubling the 
value of k results in a memory increment by 1.44 bits, 
whereas in our method the increment is of 0.11 to 0.22 
bits. 

Let us now comment on query and preprocessing times 
for our scheme. The query time can be split into two parts: 
the time spent on querying t Bloom filters and the time 
spent on querying Tt, As stated in Section "Preliminaries", 
each query in a Bloom filter corresponds to/? = r In 2 hash 
functions evaluations. Clearly, the total query time for t 
Bloom filters is tp = S (tr). Thus, it is expected that using t 
Bloom filters, even if r decreases, the query time increases. 
For instance, with ^ = 4 we have that r = 6.049 (/c = 32) 
and the total number of hash function evaluations is pro- 
portional to rt ^ 24, whereas with ^ = 1 we have that 
r = 11.44 and rt ^ 12, a factor 2 increase in the num- 
ber of hash function evaluations. On the other hand, the 
set Tt is generally much smaller than Ti, due to the above- 
mentioned exponential decrease. Depending on the data 
structure for storing T^, the time saving in querying Tt vs. 
Ti may even dominate the time loss in querying multi- 
ple Bloom filters. Our experimental results (Section "Con- 
struction algorithm" below) confirm that this situation 
does indeed occur in practice. Note that even in the case 
when querying Tt weakly depends on its size (e.g. when Tt 
is implemented by a hash table), the query time will not 
increase much, due to our choice of a small value for ^, as 
discussed earlier. 

At the preprocessing step, we need to construct Bloom 
filters B\,...yBt and set Tt. At each stage /, we need to 
store Ti-\ and Ti-2 (possibly on disk, if we want to save on 
the internal memory used by the algorithm) to construct 
Bi and Ti. A key observation is that the sizes oiBi and Ti 
decrease exponentially on / and therefore the time spent 



to construct the whole structure is a linear function on the 
size of Tq. In particular, asymptotically it is only a small 
constant factor larger compared to the original method of 
[9]. If the value of t is small (such as ^ = 4, as in Table 1), 
the preprocessing time is obviously even smaller. 

Using different values of r for different filters 

In the previous section, we assumed that each of our 
Bloom filters uses the same value of r, the ratio of bitmap 
size to the number of stored /c-mers. However, formula 
(3) for the number of bits per k-mev shows a difference 
for odd and even filter indices. This suggests that using 
different parameters r for different filters, rather than the 
same for all filters, may reduce the space even further. If 
Vi denotes the corresponding ratio for filter Bi^ then (3) 
should be rewritten to 

n + 6r2d' + nd^ + 6r4C^i+"3 + . . . , (4) 

and the minimum value of this expression becomes 7.93 
(this value is achieved with ri = 4.41; r/ = 1.44, / > 1). 

In the same way, we can use different values of r/ in the 
truncated case. This leads to a small 2% to 4% improve- 
ment in comparison with the case of unique value of r. 
Table 2 shows results for the case ^ = 4 for different values 
ofy^. 

Query distribution among filters 

The query algorithm of Lemma 1 simply queries Bloom 
filters Biy...yBt successively as long as the returned 

Table 2 Estimated memory occupation for the case of 
different values of r vs. single value of r (shown in Table 1 ), 



for 4 Bloom filters (t = 4) 


k Optimal ri,r2,r3,r4 


Bits per /c-mer 
different values of r 


Bits per /c-mer 
single value of r 


16 5.254,3.541,4.981,8.653 


8.336 


8.556 


32 5.383,3.899,5.318,9.108 


8.404 


8.664 


64 5.572,4.452,5.681,9.108 


8.512 


8.824 


128 5.786,5.108,6.109,9.109 


8.669 


9.045 
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answer is positive. The query time then directly depends 
on the number of filters applied before getting a nega- 
tive answer. Therefore, it is instructive to analyse how the 
query frequencies to different filters are distributed when 
performing a graph traversal. We provide such an analysis 
in this section. 

We analyse query frequencies during an exhaustive 
traversal of the de Bruijn graph, when each true node is 
visited exactly once. We assume that each time a true node 
is visited, all its eight potential neighbors are queried, as 
there is no other way to tell which of those neighbors are 
real. Note however that this assumption does not take into 
account structural properties of the de Bruijn graph, nor 
any additional statistical properties of the genome (such 
as genomic word frequencies). 

For a filter Bi, we want to estimate the number of 
queried /c-mers resolved by Bi during the traversal, that is 
queries on which Bi returns no. This number is the dif- 
ference of the number of queries submitted to Bi and the 
number of queries for which Bi returns yes. Note that the 
queries submitted to Bi are precisely those on which the 
previous filter Bi-i returns yes. 

If the input set Tq contains N /c-mers, then the num- 
ber of queries in a graph traversal is SAT, since for each 
true node each of its 8 potential neighbors are queried. 
Moreover, about 2N queries correspond to true /c-mers, 
as we assume that most of the graph nodes have two true 
neighbors. Filter Bi will return yes on 2N + 6c^N queries, 
corresponding to the number of true and false positives 
respectively. For an arbitrary /, filter Bi returns yes pre- 
cisely on the /c-mers inserted to Bi (i.e. /c-mers Bi is built 
on), and the /c-mers which are inserted to (which are 
the critical false positives for Bi), The counts then eas- 
ily follow from the analysis of Section "Memory and time 
usage". 

Table 3 provides counts for the first four filters, together 
with the estimated fraction of k-mers resolved by each fil- 
ter (last row), for the case of infinite number of filters. The 
data shows that 99.48% of all k-mers are resolved by four 
filters, which suggests that a very small number of filters is 
sufficient to cover a vast majority of k-mers. Furthermore, 
Table 4 shows data for 1-, 2- and 4-filter setups, this time 
with the optimal value of r for each case. Even two filters 
are already sufficient to reduce the accesses to T2 to 2.08%. 



Table 4 Estimated fractions of queries resolved by each 
filter and by the explicitely stored set Tffor t = 1,2,4, 
computed for k = 32 and optimal value of r shown in the 
second column 



Value of t 


r 


B^ 


B2 


B3 


B4 


Tt 


1 


11.44 


74.70% 


0 


0 


0 


25.3% 


2 


8.060 


73.44% 


24.48% 


0 


0 


2.08% 


4 


6.049 


70.90% 


23.63% 


3.88% 


1 .29% 


0.3% 



In case of four filters, 99.7% of k-mers are resolved before 
accessing 

Experimental results 

Construction algorithm 

In practice, constructing a cascading Bloom filter for a 
real-life read set is a computationally intensive step. To 
perform it on a commonly-used computer, the implemen- 
tation makes an essential use of external memory. Here we 
give a short description of the construction algorithm for 
up to four Bloom filters. Extension for larger number of 
filters is straightforward. 

We start from the input set Tq of /c-mers written on disk. 
In this set, for each pair of k-mer and its reverse com- 
plement we keep only one of them, the lexicographically 
smaller, and identify the other to it. We build the Bloom 
filter Bi of appropriate size by inserting elements of Tq 
successively. Next, all possible extensions of each k-mer in 
To are queried against Bi, and those which return true are 
written to the disk. Then, in this set only the k-mers absent 
from To are kept, i.e. we perform a set difference from Tq, 
We cannot afford to load Tq entirely in memory, so we 
partition Tq and perform the set difference in several iter- 
ations, loading only one partition of To each time. This 
results in the set Ti of critical false positives, which is also 
kept on disk. Up to this point, the procedure is identical to 
that of [9]. 

Next, we insert all /c-mers from Ti into B2 and to obtain 
T2, we check for each k-mer in Tq if a query to B2 returns 
true. This results in the set T2, which is directly stored on 
disk. Thus, at this point we have Bi, B2 and, by loading 
T2 from the disk, a complete representation for t = 2,ln 
order to build the data structure for ^ = 4, we continue 



Table 3 Estimations of the number of queries made to filters 81 , B2, Bs, 84 and of the fraction of queries resolved by each 
filter (for the optimal value r = 5.464), in the case of infinite number of filters 





B^ 


B2 


Bs 


fi4 


nb of queries 


8N 


(2 + 6c')N 


{6c' + 2c')N 


(2c' + 6c^')N 


Queries returning yes 


(2 + 6d)N 


(6c' + 2c')N 


{2d + 6c^')N 


(6c2' + 2c2')A/ 


Queries returning no 


(6 - 6d)N 


(2 - 2d)N 


(6c' - 6c^')N 


(2c' - 2c2')A/ 


Fraction of resolved queries 


69.57% 


23.19% 


5.04% 


1 .68% 
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this process, by inserting T2 in and retrieving (and 
writing directly on disk) Ts from Ti (stored on disk). It 
should be noted that to obtain Ti we need r/_2, and by 
always directly storing it on disk we guarantee not to use 
more memory than the size of the final structure. The 
set Tt (that is, Ti, T2 or T4 in our experiments) is repre- 
sented as a sorted array and is searched by a binary search. 
We found this implementation more efficient than a hash 
table. 

Implementation and experimental setup 

We implemented our method using MINIA software [9] 
and ran comparative tests for 2 and 4 Bloom filters {t = 
2, 4). Note that since the only modified part of MiNiA was 
the construction step and the k-mer membership queries, 
this allows us to precisely evaluate our method against the 
one of [9]. 

The first step of the implementation is to retrieve the 
list of /c-mers that appear more than d times using DSK 
[13] - a constant memory streaming algorithm to count 
k-mer s. Note, as a side remark, that performing count- 
ing allows us to perform off-line deletions of /c-mers. That 
is, if at some point of the scan of the input set of /c-mers 
(or reads) some of them should be deleted, it is done by a 
simple decrement of the counter. 

Assessing the query time is done through the proce- 
dure of graph traversal, as it is implemented in [9] . Since 
the procedure is identical and independent on the data 
structure, the time spent on graph traversal is a faithful 
estimator of the query time. 

We compare three versions: t = 1 (i.e. the version of 
[9]), t = 2 and t = 4. For convenience, we define 1 Bloom, 
2 Bloom and 4 Bloom as the versions with ^ = 1, 2 and 4, 
respectively. 

E.coli dataset, varying k 

In this set of tests, our main goal was to evaluate the influ- 
ence of the k-mer size on principal parameters: size of 
the whole data structure, size of the set Tt, graph traver- 
sal time, and time of construction of the data structure. 
We retrieved lOM E, coli reads of lOObp from the Short 
Read Archive (ERX008638) without read pairing informa- 
tion and extracted all k-mers> occurring at least two times. 
The total number of k-mers considered varied, depending 
on the value of /c, from 6,967,781 [k = 15) to 5,923,501 
(k = 63). We ran each version, 1 Bloom [9], 2 Bloom and 
4 Bloom, for values of k ranging from 16 to 64. The results 
are shown in Figure 1. 

The total size of the structures in bits per stored k- 
mer, i.e. the size of Bi and Ti (respectively, ^i,^2>^2 or 
BiyB2,B3,B4,T4) is shown in Figure 1(a). As expected, the 
space for 4 Bloom filters is the smallest for all values of k 
considered, showing a considerable improvement, ranging 
from 32% to 39%, over the version of [9]. Even the version 



with just 2 Bloom filters shows an improvement of at least 
20% over [9], for all values of k. Regarding the influence of 
the k-mer size on the structure size, we observe that for 
4 Bloom filters the structure size is almost constant, the 
minimum value is 8.60 and the largest is 8.89, an increase 
of only 3%. For 1 and 2 Bloom the same pattern is seen: a 
plateau from k = 16 to 32, a jump for k = 33 and another 
plateau from k = 33 to 64. The jump at k = 32 is due to 
switching from 64-bit to 128-bit representation of k-mer s 
in the table Tt, 

Figure 1(b) shows the size of table Tt (number of /c-mers) 
for ^ = 1, 2, 4, depending on k. It clearly demonstrates 
the sharp decrease of the size of Tt with growing t, in 
accordance with the exponential decrease estimated ana- 
lytically in Section "Memory and time usage". We also 
observe a decrease in the size of Tt with growing k for 
t = I and, to a smaller extent, for t = 2, while for ^ = 4 
the decrease is not noticeable. This is explained by the 
increase rate of optimal r (Table 1) which is high for ^ = 1, 
smaller for ^ = 2 and yet smaller for ^ = 4. Since the 
size of Tt is 0(Nc^^^'^) (Section "Memory and time usage") 
for c < 1 and almost invariable N, the decrease rate is 
exponential w.r.t. the increase rate of r. 

Traversal times for each version are shown in 
Figure 1(c). The fastest version is 4 Bloom, showing 
an improvement over [9] of 18% to 30%, followed by 2 
Bloom. This result is surprising and may seem counter- 
intuitive, as we have four filters to apply to the queried 
k-mer rather than a single filter as in [9]. However, the 
size of T4 (or even T2) is much smaller than Ti, as the size 
of T/s decreases exponentially. As Tt is stored in an array, 
the time economy in searching T4 (or T2) compared to 
Ti dominates the time lost on querying additional Bloom 
filters, which explains the overall gain in query time. 

As far as the construction time is concerned 
(Figure 1(d)), our versions yielded also a faster construc- 
tion, with the 4 Bloom version being 5% to 22% faster 
than that of [9]. The gain is explained by the time required 
for sorting the array storing Tt, which is much higher for 
To than for T2 or T4,, However, the gain is less significant 
here, and, on the other hand, was not observed for bigger 
datasets (see Section "Human dataset"). 

E coli dataset, varying coverage 

From the complete E. coli dataset (^44M reads) from 
the previous section, we selected several samples ranging 
from 5M to 40M reads in order to assess the impact of 
the coverage on the size of the data structures. This strain 
E, coli (K-12 MG1655) is estimated to have a genome of 
4.6M bp [14], implying that a sample of 5M reads (of 
lOObp) corresponds to ~100X coverage. We setd = 3 and 
k = 27. The results are shown in Figure 2. As expected, 
the memory consumption per k-mer remains almost con- 
stant for increasing coverage, with a slight decrease for 
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2 and 4 Bloom. The best results are obtained with the 4 
Bloom version, an improvement of 33% over the 1 Bloom 
version of [9] . On the other hand, the number of distinct k- 
mers increases markedly (around 10% for each 5M reads) 
with increasing coverage, see Figure 2(b). This is due to 
sequencing errors: an increase in coverage implies more 
errors with higher coverage, which are not removed by our 
cutoff d = 3, This suggests that the value of d should be 
chosen according to the coverage of the sample. Moreover, 
in the case where read qualities are available, a quality con- 
trol pre-processing step may help to reduce the number of 
sequencing errors. 

E coli dataset, query statistics 

In this set of tests we used the dataset of Section "E.coli 
dataset, varying k" to experimentally evaluate how the 
queries are distributed among the Bloom filters. We ran 



the graph traversal algorithm for each version, 1 Bloom 
[9], 2 Bloom and 4 Bloom, using values of k ranging from 
16 to 64 and retrieved the number of queries resolved in 
each Bloom filter and the table T^. The results are shown 
in Figure 3. The plots indicate that, for each version, the 
query distribution among the Bloom filters is approxi- 
mately invariant to the value of k. Indeed, on average 74%, 
73% and 70% of the queries are resolved in Bi for the 1, 
2 and 4 Bloom version, respectively, and the variance is 
smaller than 0.01% in each case. For the 4 Bloom version, 
70%, 24%, 4%, 1% and 0.2% of the queries are resolved in 
Bi, B2, B'i, B4 and T4, respectively, showing that the val- 
ues estimated theoretically in Section "Query distribution 
among filters" (the last row of Table 4) are very precise. 
Furthermore, as a query to a Bloom filter is faster than 
to Ti and the majority of the queries to 4 and 2 Bloom 
versions, 94% and 95% respectively, are resolved in the 
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first two filters, it is natural that on average queries to 
1 Bloom version are slower than to 2 and 4 Bloom ver- 
sions, corroborating the results of Section "Exoli dataset, 
varying k". 

Human dataset 

We also compared 2 and 4 Bloom versions with the 1 
Bloom version of [9] on a large dataset. For that, we 
retrieved 564M Human reads of lOObp (SRA: SRX016231) 
without pairing information and discarded the reads 
occurring less than 3 times. The dataset corresponds 
to ^17X coverage. A total of 2,455,753,508 /c-mers were 
indexed. We ran each version, 1 Bloom [9], 2 Bloom and 4 
Bloom with /c=23. The results are shown in Table 5. 

The results are in general consistent with the previ- 
ous tests on E.coli datasets. There is an improvement of 
34% (21%) for the 4 Bloom (2 Bloom) in the size of the 
structure. The graph traversal is also 26% faster in the 
4 Bloom version. However, in contrast to the previous 
results, the graph construction time increased by 10% and 
7% for 4 and 2 Bloom versions respectively, when com- 
pared to the 1 Bloom version. This is due to the fact that 
disk writing/reading operations now dominate the time 
for the graph construction, and 2 and 4 Bloom versions 
generate more disk accesses than 1 Bloom. As stated in 
Section "Construction algorithm", when constructing the 
1 Bloom structure, the only part written on the disk is Ti 
and it is read only once to fill an array in memory. For 
4 Bloom, Ti and T2 are written to the disk, and Tq and 
Ti are read at least one time each to build B2 and ^3. 
Moreover, since the size coefficient of Bi reduces, from 
r = 11.10 in 1 Bloom to r = 5.97 in 4 Bloom, the number 
of false positives in Ti increases. 

Discussion and conclusions 

Using cascading Bloom filters for storing de Bruijn graphs 
has clear advantages over the single-filter method of [9]. 



Table 5 Results of 1 , 2 and 4 Bloom filters version for 564M 
Human reads of 1 00 bp using k = 23 



Method 


1 Bloom [9] 


2 Bloom 


4 Bloom 


Construction time (s) 


40160.7 


43362.8 


44300.7 


Traversal time (s) 


46596.5 


35909.3 


34177.2 


r coefficient 


11.10 


7.80 


5.97 




B^ = 3250.95 


5] = 2283.64 


5i = 1 749.04 


Bloom filters size (MB) 




B2 = 323.08 


B2 =591.57 
83 = 100.56 
B4 = 34.01 


False positive table 
size (MB) 


7"i = 545.94 


72 = 425.74 


74 = 36.62 


Total size (MB) 


3796.89 


3032.46 


2511.8 


Size (bits/^-mer) 


12.96 


10.35 


8.58 



In terms of memory consumption, which is the main 
parameter here, we obtained an improvement of around 
30%-40% in all our experiments. Our data structure takes 
8.5 to 9 bits per stored /c-mer, compared to 13 to 15 
bits by the method of [9]. This confirms our analytical 
estimations. The above results were obtained using only 
four filters and are very close to the estimated optimum 
(around 8.4 bits//c-mer) produced by the infinite number 
of filters. This is consistent with both our analytical esti- 
mations and experimental data showing that over 99% of 
queries are resolved by the four filters, without resort- 
ing to the explicitely stored set 7^. Even two filters only 
resolve about 95% of queries. An interesting characteris- 
tic of our method is that the memory grows insignificantly 
with the growth of even slower than with the method of 
[9]. Somewhat surprisingly, we also obtained a significant 
decrease, of order 20%-30%, of query time. The construc- 
tion time of the data structure varied from being 10% 
slower (for the human dataset) to 22% faster (for the bac- 
terial dataset). Cascading Bloom filters have now been 
implemented by default in the MINI A software [15]. 

As stated previously, another compact encoding of de 
Bruijn graphs has been proposed in [10], however no 
implementation of the method was made available. For 
this reason, we could not experimentally compare our 
method with the one of [10]. We remark, however, that 
the space bound of [10] heavily depends on the number 
of reads (i.e. coverage), while in our case, the data struc- 
ture size is almost invariant with respect to the coverage 
(Section coli dataset, varying coverage"). 

An interesting open question is whether the Bloom fil- 
ter construction can be made online, so that new /c-mers 
(reads) can be inserted without reconstructing the whole 
data structure from scratch. Note that the presented con- 
struction (Section "Construction algorithm") is inherently 
off-line, as all /c-mers should be known before the data 
structure is built. 

Another interesting prospect for possible further 
improvements of our method is offered by work [16], 
where an efficient replacement to Bloom filter was intro- 
duced. The results of [16] suggest that we could hope 
to reduce the memory to about 5 bits per k-mei. How- 
ever, there exist obstacles on this way: an implementation 
of such a structure would probably result in a significant 
construction and query time increase. 

Endnotes 

^Note that this is actually a subgraph of the de Bruijn 
graph under its classical combinatorial definition. 
However, we still call it de Bruijn graph to follow the 
terminology common to the bioinformatics literature. 

^By a slight abuse of notation, we also yiewBj as the set 
of all k-mers on which the filter Bj returns the positive 
answer. 
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